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Abstract 

This study poses the feature correspondence problem 
as a hypergraph node labeling problem. Candidate fea- 
ture matches and their subsets (usually of size larger 
than two) are considered to be the nodes and hyper- 
edges of a hypergraph. A hypergraph labeling algorithm, 
which models the subset-wise interaction by an undi- 
rected graphical model, is applied to label the nodes (fea- 
ture correspondences) as correct or incorrect. We de- 
scribe a method to learn the cost function of this label- 
ing algorithm from labeled examples using a graphical 
model training algorithm. The proposed feature match- 
ing algorithm is different from the most of the exist- 
ing learning point matching methods in terms of the 
form of the objective function, the cost function to be 
learned and the optimization method applied to mini- 
mize it. The results on standard datasets demonstrate 
how learning over a hypergraph improves the matching 
performance over existing algorithms, notably one that 
also uses higher order information without learning. 



1. Introduction 

Identifying feature correspondence is an important 
problem in computer vision (see references in [8 ). In 
general, matching features using only the appearance 
descriptor values can often result in many incorrect 
matches. To address this problem, most algorithms 
for feature correspondence combine information about 
both appearance and geometric structure among the 
feature locations. Several methods [2l [TSl fT5l 171 fTO] 
utilize the pairwise geometric consistency, along with 
the pointwise descriptor similarity, to design a match- 
ing cost function which is minimized using various opti- 
mization algorithms. For example, [181 [71 [15] uses spec- 
tral techniques to compute a 'soft' assignment vector 
that is later discretized to produce the correct assign- 
ment of features. These works model the appearance 
and pairwise geometric similarity using a graph, either 
explicitly or implicitly, and are commonly known as 



graph matching algorithms. The soft assignment vec- 
tor is typically computed by an eigen-decomposition of 
the compatibility or the match quality matrix. Several 
studies applied graph matching algorithms for various 
vision problems [TB]. 

Caetano et.al. [5 discusses how the parameters of 
the matching cost function (primarily the match com- 
patibility scores) can be learned from pairs with la- 
beled correspondences to maximize the matching accu- 
racy. A more recent work [16] proposes to learn similar 
matching scores in an unsupervised fashion by repeat- 
edly refining the soft assignment vector. 

Higher order relationship among the feature points 
have also been investigated as the means of improving 
the matching accuracy. Zass et.al. [21 assumes two 
separate hypergraphs among the feature points on two 
images and propose an iterative algorithm to match 
the the two hypergraphs. On the other hand, Olivier 
et.al. [8 generalize the pairwise spectral graph match- 
ing methods for higher order relationships among the 
point matches. The pairwise score matrix is general- 
ized to a high order compatibility tensor. The eigen- 
vectors of this tensor are used as the soft assignment 
matrix to recover the matches. 

In our framework, each feature correspondence is 
considered as a datapoint and we assume a hypergraph 
structure among these datapoints (similar to ^). That 
is, we conceive a subset of candidate feature matches 
as a hyperedge of the hypergraph. For subsets of such 
datapoints, we assume that the relationship among fea- 
tures of one image follows the same geometrical model 
as that present among the corresponding features in the 
other image. We compute the likelihood, using this ge- 
ometrical model, for every subset of datapoints and use 
it as weight of the hyperedge. The objective is to label 
the datapoints, i.e., matches to be correct or incorrect, 
given this hypergraph structure among them. 

We adopt a hypergraph node labeling algorithm pro- 
posed in [17 . Given a hypergraph, where the hyper- 
edge weights are computed using a model, this algo- 
rithm produces the optimal labeling of the nodes that 



maximally conforms with the hyperedge weights or 
likelihood values. Within the framework, the higher 
order interaction among subsets of datapoints is mod- 
eled using a higher order undirected graphical model 
or the Markov network (see ^J ^^^ details). The la- 
bels are computed by solving the inference problem on 
this graphical model where a labeling cost or energy 
function is minimized to produce the optimal labeling. 

In this paper, we show that the framework of hyper- 
graph node labeling of [17] can be applied for feature 
matching. In addition, we show how it is possible, and 
in fact advantageous, to learn (a parametric form of) 
the) cost function for matching given several labeled ex- 
amples of feature correspondences. The learned forms 
of cost functions are able to appropriately weight the 
label disagreement cost for different subsets. For ex- 
ample, if the number of subsets containing more accu- 
rate matches than the inaccurate ones, the associated 
penalty function will attain a higher weight to balance 
the relative importance. The learning procedure is gen- 
eral, i.e., in addition to the feature matching, it can be 
utilized for any application of the labeling problem [17] . 

Point matching problem was addressed by a proba- 
bilistic graphical model before, in jHITS], enforcing a 
graph among the points for spatial consistency. The 
required potential (cost) functions in these two studies 
were pre-selected and not learned from the data. Our 
approach can handle match interaction in larger sets 
and demonstrates the advantage of learning the cost 
functions from the data. Feature matching problem 
has also been cast as an energy minimization problem 
in ^. 

1.1. Contribution: 

At this point, we would like to clarify what aspect 
of learning (hypergraph labeling for) point matching is 
different from earlier works. Let us suppose Xi G {0, 1} 
is the label for i-th candidate feature match, Xi = 1 
implies a correct match and Xi = implies an incor- 
rect one. Let Hyk be the match compatibility score 
of a subset V^ of matches of size k. The popular 
graph and tensor matching algorithms maximize the 
following overall matching score to retrieve the correct 
matches ^HlTllH]. 



S{X) = J2^v' U 



(1) 



The score function is a weighted summation of subset- 
wise label concurrence function, s{V^) = fl^^i ^ir ^^" 
tice that, s{V^) is a. binary valued function: s{V^) = 1 
only when all labels x^^ , . . . , x^^ are equal to 1 and 
otherwise. Instead of using this predefined bi- 
nary valued function, we investigate whether or not 



such label agreement function (or, conversely a dis- 
agreement cost function) can be learned from la- 
beled matches. We believe it is particularly useful 
to learn this function for higher order {k > 2) meth- 
ods. To illustrate the necessity of such learning, we 
show two images in Figure [T] with candidate feature 
matches (A, i^i), (1^2, ^^2), (1^3, i^s) and {Ds,F^) , ah 
with equal matching probability, overlaid on them. 




Figure 1. Triangle pairs with overlapping matches. 



It is assumed that the geometrical arrangement 
among matching features can be encoded by trian- 
gle. Clearly, the similarity between triangles D1D2DS 
(red) and F1F2F3 (green) will be high resulting 
in a large match compatibility Hys (where V^ = 
{{Di,Fi),{D2,F2),{D3,Fs)}). Notice that the trian- 
gle F1F2F4 (blue dashed) would also have relatively 
large similarity with D1D2DS. Though this subset 
{(Di, Fi), (D27 ^2)7 (^3, ^4)} of matches contain one 
incorrect match (D3, F4), it still provides us significant 
geometric information about the two correct matches 
(Di^Fi) and (1^2,^2) • Incorporating this informa- 
tion in the algorithm should assist establishing more 
correct correspondences among the features. However, 
the form of s{V^) = Y[i=i ^n does not explicitly handle 
this situation, even when Xi^ is relaxed to take values in 
real domairQ One needs to learn an appropriate label 
agreement (or disagreement cost) function to explicitly 
include this information in the framework. Learning 
the cost function can also counteract the uneven ratio 
of subsets with more correct matches and those with 
more incorrect matches. 

As it will be explained in details later, to determine 
the correspondence, we in fact minimize a cost function 
of the form as follows. 

yk 

-^{1- Hyk) go{xi^,...,Xi^). (2) 

This paper describes how to learn appropriate subset- 
wise label disagreement cost functions (also referred 
as penalty functions) gi and go from labeled matches. 
Our approach is significantly different in concept from 
previous learning algorithms for correspondence. The 



■"^For binary x^^ , s{V^) = with one incorrect match in V^ 
and therefore the compatibihty score is ignored. 



algorithms of O [16] aim to learn a match compati- 
bility function Hyk from the data to optimally reflect 
accurate correspondences among the features. On the 
contrary, our algorithm learns the label disagreement 
cost functions gi and g^ to minimize the total label dis- 
agreements within the subsets given the subset match- 
ing qualities Hyk . The next section describes how fea- 
ture correspondence can be cast as a hypergraph label- 
ing problem as defined in [17] 

2. Matching as hypergraph labehng 

Given two images II and Ir^ we denote ai and a^ 
to be the indices of feature points from II and Ir re- 
spectively. In general, the number ul of features in 
II is different from the number ur of features in Ir. 
Each candidate match (a^, a^) is considered to be a dat- 
apoint Vi^ z = 1, . . . ,n, in our approach. The goal is 
to partition the dataset V = {vi^ . . . ,'u^} into subset 
A comprising correct correspondences and to B com- 
prising incorrect ones. This is a data labeling problem 
where the binary label Xi G {0, 1} of Vi needs to be 
assigned Xi = 1 if Vi belongs A and to otherwise. 

We wish to exploit the information about sub- 
sets of datapoints to enforce geometric consistency 
in matching. More specifically, for a subset V^ = 
{vi^,...,Vi^} = {(a/,,a^J,...,(a^^,arJ} of size k of 
matching points, we assume the geometric relation- 
ship among {a/^, . . . , a^^} to be similar to that among 
{a^^, . . . ,a^^}. This similarity value (computed by a 
suitable function) is denoted by A(V^) G [0,1]. No- 
tice that, we are effectively dealing with a hypergraph 
with datapoints Vi as the nodes and the subsets V^ 
as the hyperedges. Given such hypergraph, the label- 
ing algorithm is supposed to partition the set of nodes 
into two sets A and 5, corresponding to correct and 
incorrect matches respectively. We will use the term 
likelihood value and weight interchangeably when re- 
ferring to similarity value A(V^). 

The work in [17] models the higher order interac- 
tions in this hypergraph by a Markov network (by a 
Conditional Random Field (CRF) to be precise) [9]. 
The optimal labeling can then be achieved by solving 
the inference for this CRF model. We follow this rep- 
resentation which is described in the next section. 

3. The cost function 

Let V^ be the set of all hyperedges V^ in this hy- 
pergraph. Let X = {xi, . . . ,Xn} be a label assignment 
of the nodes V of the hypergraph. The cost function 
that asserts discrepancy of node assignments X in the 
hypergraph nodes V can be written as 



8{X,V)^ Y^ e\x\v''), 



(3) 



yfc^yfc 



where X^ is the set labels of member nodes of sub- 
set V^ and E^ is the local discrepancy, i.e., the cost 
of assignment X^ in V^. We assume functionally ho- 
mogeneous local costs, E^ = E. Given this represen- 
tation, it is possible to construct an equivalent CRF 
with clique potentials E^ (see [17 ) and formulate the 
optimal assignment task as the inference in this CRF. 

Following [17 , each clique potential E is repre- 
sented as 

E(X';V')= P, X{V') giir^o) + po (1 - X{V')) go{m)- 

Here, gcj c = 0, 1 represent a penalty function : 
the cost of assigning clique nodes to an incorrect class 
(eg, match to non-match and vice- versa). The penalty 
function is defined as a function of 7^i_c, the number of 
nodes in the clique whose label differs from the clique 
hypothesis c. j3c are non-negative balancing parameters 
and 770 + ^1 = ^• 

Intuitively, this potential penalizes, via functions gc^ 
the label assignments incompatible with one of the two 
hypotheses, matching and non-matching features. To 
achieve this, the penalties gc should be non- decreasing 
in 77i_c. If the likelihood of matching, A(y^), is high, 
the potential seeks to decrease r^o, the number of as- 
signments to "not-matching" hypothesis. In the oppo- 
site case, with high non-matching likelihood 1 — A(F^), 
the potential attempts to decrease the number of labels 
incompatible with this hypothesis, rji. 

Penalty functions gc could be directly modeled as 
linear and nonlinear functions of number of label dis- 
agreement Tji-c in the clique. However, as it will be- 
come clear later, it is advantageous to learn a nonlin- 
ear mappings gc from labeled data. The next section 
describes how the functions gc can be learned from la- 
beled matches/mismatches. 

4. Learning penalty functions 

Given J hypergraphs with hyperedges V^, j = 
1, . . . , J, along with the weights and labels Xj of the 
datapoints (or correspondences), we wish to learn the 
parametric form of the gc functions. We first describe 
two parametric forms of the penalty functions so that 
the clique potentials, as defined in Equation |4] become 
log-linear models. In particular, we seek to express the 
potential as a linear combination of factors defined over 
each clique) [9 

E{X^ ; y') = ^^^0^(x^y'). (5) 

I 
In this definition, (j)i{X^; V^) are the factors and wi 
are the mixing weights. The following sections explain 



how restating the penalty functions in this manner fa- 
cihtates learning using CRF training algorithms. 

4.1. Discrete gc 

First, we express gc as a discrete function. Observe 
that, penalty functions gc are defined on ^(i-c) val- 
ues, which are integers in our case. Therefore, it suf- 
fices to learn a set of discrete mapping gc{r](i-c)) for 
all c G {0, 1} and < Vii-c) ^ k. Let us introduce two 
quantities as follows 

w'^^PcQcia), (6) 

0e(a;V^') = -Ae(y')/(ry(i_,),a), (7) 



where I{s^t) is an indicator function which equals to 1 
only when s is equal to t and otherwise. Furthermore, 
the likelihood weights are denoted by Xi{V^) = X{V^) 
and Ao(V^^) = 1 — A(V^) for notational convenience. 
Notice that, in this case, (j)c functions are the factors 
(for each clique) that assume nonzero values only when 
r]i-c = o^- The clique cost function defined in Equa- 
tion [4] can be rewritten as follows 

k 
c a:=0 

This definition of gc expresses the joint probability 
of any assignment as log-linear model. For this form of 
^c, the values of w*^ are learned for all a = 1, . . . , 7^(i_c) 
and c = 0, 1. 

4.2. Second order polynomial gc 

Unconstrained forms of gc may be prone to overfit- 
ting. We thus propose a more constrained gc by as- 
suming a second order polynomial form for it. In this 
case, this function can be expressed using the Taylor 
expansion around reference point 0: 

g,ia)=9i°^+agi'^ + ^9i'\ (9) 

In Equation|9) gc \gc and gc are the 0, 1st and 2nd 
order derivatives of gc at 0. The features for this case 
can be defined as 

k 

^?(a;y'=) = - ^A.(l/'=) /(7?(i_,),7), (10) 

7=0 
k 

^l{a-V^)^- ^aAe(y')/(r;(i_,),7), (H) 

7=1 

^I[a-V')^- ^^Ae(y')/(r;(i_e),7). (12) 

7=1 

Then, the cost function in Equation |4] can be ex- 
pressed as linear combination of features V^^l^)' ^ ~ 
0,...,2 



For polynomial ^c? we learn the values of gc for all 
e = 0, 1, 2 and c = 0, 1. This redefinition of gc has the 
benefit of regulating the learned form to be of some 
specific type. Also, regardless of the size k or data 
subset, we only need to learn 3 x C parameters, where 
C is the total number of classes. Next section briefly 
discusses existing techniques for learning CRFs. 

4.3. Learning algorithms 

In last two sections we have shown that the clique 
potential function of the proposed framework can be 
expressed as a linear combination of features or factors. 
The joint probability of any label configuration for a 
CRF, with discrete form of ^c, can be stated as follows 



p{X I V) 



1 



Z{V) 



exp 
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Y,Y.3i^Hl{a-v'y 



(13) 



(14) 
where Z(y) is a normalizing term, ZiV) = 

^xPi^ I ^)- "^^^ joint probability will be similar 
for second order polynomial gc and we are omitting 
the derivation for it here. There are two types of algo- 
rithms to estimate the parameters w^ from data: one 
that aims at determining the parameters by maximiz- 
ing the log-likelihood [9 and the other that maximizes 
the separation, or the label margin, between classes of 
datapoints [1]. 

4.3.1 Likelihood Maximization 

The log-likelihood function for the training data is 
given by 

J 1 fc 

liw) = J2 H ^^<^c(a;V^'=)-log Z(y). (15) 

j = l yk^yk c—0 a — 

It has been shown that l{w) is concave [9 . Therefore, a 
Gradient Ascent algorithm is able to produce the glob- 
ally optimal values for w^. It is straightforward to see 
that the gradient with respect to w^ is the difference 
between summation of observed and expected 0c (<^) 
values 

^-E E *.(-'") 

-T. J2 E'^=(«;^')p(^'i^')- (16) 

We used a sum-product belief propagation al- 
gorithm [14 to compute the marginal posteriors 
p{X^ \ V^). A regularizer term was added to the 
likelihood function to penalize large parameter values. 
Apart from Gradient Ascent, other algorithms such as 
Conjugate Gradient and L-BFGS have also been for 
this maximization problem [9]. 



4.3.2 Margin maximization 



6.1. House, Hotel and Horse data 



The second type of algorithms try to estimate the pa- 
rameters by maximizing the class margin of the labeled 
examples. Margin maximization is useful if the data 
distribution is biased to one of the classes or there are 
many noisy samples in the data. Bartlett et.al. [1 pro- 
posed a constrained optimization problem, in terms of 
primal variables w^, for parameter learning in max- 
imal margin setting. Their formulation minimizes a 
loss function, defined in terms of the number of incor- 
rectly labeled examples, and a regularizer term. An 
exponentiated gradient (EG) algorithm is applied to 
minimize the objective that updates the primal vari- 



ables w^ similarly as in Equation 16 In addition, the 



EG algorithm also updates the the dual variables to 
minimize the subset- wise mislabeling error. Further- 
more, the marginal terms are different from those in 
likelihood maximization - in [1 , they are calculated 
from a Markov network where the dual variables act as 
potential functions. 

More efficient version of both these algorithms have 
been described in [6]. In our experiments, parameters 
were learned by standard Gradient Ascent optimization 
to maximize the likelihood for a discrete g^ 

5. Inference 

Once gd')^ c G {0,1}, are learned, problems with 
nonlinear ^c(') can be solved using any efficient Markov 
network inference algorithm. See [TTJ |20], and refer- 
ences therein. We adopted the sum-product belief 
propagation |14] since we also use it for computing the 
marginal probabilities p{X^ | V^) required to learn the 
parameters. The output of this algorithm is belief (ap- 
proximate marginal probability) 6^(1) and 6^(0) that 
any datapoint Vi belong to class 1 and respectively. 

The belief values for each datapoint Vi could be used 
to determine the hard one to one assignment for any 
feature ai of image II to its unique match a^ on image 
Ir. To do this, for each a/, we select the match corre- 
sponding to the datapoint with the largest ratio of two 
beliefs -^^ among all the datapoints associated with 
a^ The accompanying feature a^ on the right image is 
selected as the resultant match for ai. This method of 
discretization is similar to [T5|. 

6. Experiments and Results 

This section describes different matching experi- 
ments conducted on standard datasets to test the pro- 
posed method and compares the performances with 
past studies. For all the experiments, the penalty func- 
tions were learned using Gradient Ascent to maximize 
the likelihood for a discrete mapping gc (Section |4.1[). 



We conduct our first experiment on the standard 
House and Hotel datasets. Each of these datasets con- 
tains a sequence of (around 100) images of a toy house 
(or hotel) seen from increasingly varying viewpoint. 
Locations of a set of keypoints, that appear on each 
of the image of the sequence, are available for both 
these sequences. 

Another synthetic dataset, namely the silhouette 
images of a Horse as used in [5 , were also included 
in this experiment. From a single silhouette image, 
two sequences of 200 images were generated by shear- 
ing and rotating. The width of the image is sheared 
to twice of its height at most and the maximum angle 
of rotation was 90 degrees. These image transforma- 
tions are different from those present in House and Ho- 
tel datasets. The feature locations are extracted by a 
sampling method as in [5 . 

For the proposed algorithm, the Geometric Blur 
(GB) [3] descriptor is used to represent each feature. 
For each keypoint ai in image 7^, m = 3 candidate 
matches, denoted by the set fi{ai)^ are chosen based 
on largest normalized correlation between the GB de- 
scriptors. Each of the candidate matches is considered 
to be a datapoint Vi. 

We construct a hypergraph of edge cardinality k = 3 
with these datapoints. For each feature point ai in 
image J^, all possible triangles are generated among 
ai and /cat at = 5 nearest neighbors. Any such trian- 
gle among {a^^, . . . , a^^}, has k^ possible matching tri- 
angles in image Ir induced by the set of candidate 
matches {jj.{ai-^)^ . . . , /i(a^^)}. This construction of hy- 
pergraphs among matches follows that of [8 and |21J, 
except [8 searches all possible triangles in image Ir 
instead of searching the ones induced by candidate 
matches. The geometric similarity of these triangle 
pairs are evaluated by the sum of squared difference of 
the angles similar to the tensor matching algorithm [8] . 
The parametric difference e between triangles is con- 
verted to geometric similarity weight using 1 — f where 
S = 0.5 for all experiments in this sectiorj^ 

The appearance similarity value is the normalized 
correlation between two GB descriptors computed for 
potential matching features. Each candidate match is 
assigned a weight that reflects the quality of the match 
computed by normalized correlation [15 . To compute 
the overall similarity Ai(V^) between two triangles, the 
weight of corresponding matches {jii{ai^)^ . . . , jii{aij^)} is 
multiplied with the geometric similarity weight com- 
puted from parametric difference between two trian- 
gles. 



^Triangle pairs with e > 6 are discarded. 
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Figure 2. (Left to right) House, Hotel, Horse-Shear, Horse-Rotate: Mean and std deviation of incorrect matches. 



We consider four sets of image pairs where, in each 
pair, the two images are {20, 40, 60, 80} frames apart 
from the other (also 100 for Horse datasets). For each 
set of image pairs, first five pairs were selected to learn 
the parameters for the proposed matching algorithm. 
We learned the parameters for a discrete gc by maxi- 
mum likelihood (ML) method (refer to Section H|. 

The performance of our algorithm is compared 
against the following algorithms: 

1. Tensor matching method [8^ (implementation 
available at author's website): The parameter 
values such as number of triangles to be gener- 
ated, number of nearest neighbors of each triangle 
and the distances are tuned to produce the best 
results in each of the experiments. 

2. Graph matching of [15]: We used the exact same 
procedure as described in the paper with the same 
m = 3 candidate matches for each keypoint and 
the used 3 as the distance threshold to determine 
the neighboring keypoints (also tuned for best re- 
sult). 

3. Learning graph matching [5|: The results of learn- 
ing both the linear and quadratic assignments have 
been used for comparison. 

Figure |2] shows the percentage of incorrect matches pro- 
duced by these and proposed method. Some qualitative 
results are supplied as supplementary material. 

The results show that none of the spectral Graph 
matching and Tensor matching techniques was able to 
perform well on all of these datasets. On the other 
hand, the proposed method, with learned cost func- 
tions is more robust and accurate than all other meth- 
ods in House, Hotel and Horse-shear datasets. The 
result of learned Linear Assignment procedure of [5] 
closely follows that of our method. However, learn- 
ing linear assignment produces unacceptably high er- 
ror rates (much higher than the proposed method) for 
Horse-rotate dataset. This is due to the fact that Lin- 
ear Assignment learns the weight vector for descriptor 
similarity for a candidate match. Unless the window- 
in which the descriptor is computed- is also rotated. 



the descriptor similarity would be too low in rotated 
images for a weight vector to generate a correct match. 
This observation supports the claim made in [16 that, 
in general. Linear Assignment alone can not result in 
accurate matches. The proposed algorithm and Graph 
matching [15 could not identify the correct matches 
for larger rotational angles (>80 degrees) due to infe- 
rior initial candidate matches. 

These results attest the advantage of using higher 
order information and learning the cost function 
for matching. Utilizing higher order information 
consistently produced higher accuracy than learning 
Quadratic Assignment in all but one dataset. The Ten- 
sor matching algorithm, which uses higher order infor- 
mation but does not lear from data, was not robust 
either on different dataseto The reason for this be- 
havior was surmised in the introduction: the number 
of subsets generated by higher order algorithm is usu- 
ally large with imbalanced ratio of useful subsets. One 
needs to learn the appropriate cost functions for accu- 
rate labeling of the members of these subsets. How- 
ever, it is interesting to see that both Quadratic As- 
signment [5 and Tensor matching [8 produced a per- 
fect matching for rotated images (Figure |2J rightmost 
plot). Indeed, [8 also reports similar matching results 
on synthetic 2D points. 





Figure 3. Parameters learned by ML method, left: wi, right 

: Wq. 

In Figure [3] , we show the discrete gc learned by the 
ML algorithm for c = 0,1. As expected, the learned 
penalty functions resembles strongly to smooth concave 
(k;i, left in Figure |3| and convex (k;o, right in Figure |3| 

^In 8 , the authors did not report the results on all possible 
pairs of images. Results for one pair of images for each interval 
on House dataset were reported are these values are the same as 
the minimum error rates of our result. 



functions. The forms of ^c functions also provides some 
insight about the subsets generated for matching. A 
convex penalty imposes 'lenient' penalties on lower 
values of r^i, number of label variables assuming the 
opposite class, class 1. This penalty function would 
be effective when there are many subsets comprising 
very few (e.g., one) correct matches. For these sub- 
sets, a convex go would allow to let few datapoints 
within the subset to assume the opposite label 1. Ex- 
amining the matching triangles used for matching, one 
can verify that there are indeed many subsets that con- 
tains one correct matches and two incorrect matches in 
them. On the other hand, the triangles with all correct 
matches are rare and therefore the penalty function is 
'strict' (i.e., concave) on the value of 7^0 • 

More plots of such learned penalty functions, as 
well as non-discretized belief values (i.e., the soft as- 
signment vector) generated by inference algorithm and 
some qualitative matching results are presented as sup- 
plementary material. 

6.2. KTH Activity 

We applied our method on some KTH activity recog- 
nition data [5 . For this dataset, we chose three activ- 
ities, walking, jogging and hand waving and for each 
of these activities we randomly selected two sequences. 
The experimental setup is almost same as above except 
the features are detected using Kadir-Brady (KB) key- 
point detector algorithm [12] on both the images, i.e., 
we do not manually select keypoints on image. For 
each keypoint selected by the feature detector (KB) on 
the left image, the goal is to find its best match on the 
right image. 

One of the objectives of this experiment is to show 
the necessity of learning the penalty function instead 
of employing predefined (linear) ones. We applied the 
labeling algorithm with predefined linear penalty func- 
tions and compared the results to show the improve- 
ment achieved by learning gc. For the learning al- 
gorithms, discrete gc functions are learned using the 
ML estimation procedure as before. All parameters 
for both methods are the same for all the experiments 
in this section. Sample output matches are shown in 
Figure |4] The top row shows the output produced by 
the proposed method using linear penalties, and the 
bottom row shows the results produced by discrete 
gc trained from data. The matching algorithm with 
learned penalty function were able to extract more ac- 
curate matches than that with linear penalties. 

Table [l] summarizes the quantitative matching per- 
formances of these two methods. The results clearly 
show that hypergraph labeling with learned penalty 
function consistently produces better results than the 



same method with predefined linear penalties. It is 
worth mentioning here that the proposed matching al- 
gorithm was applied to the (spatially clustered) key- 
point locations detected by the KB detector leading 
to variable number of feature locations in different im- 
ages. We manually counted the number of correct and 
incorrect matches from the output for quantitative per- 
formance evaluations. 

The learned penalty functions for each of these 
datasets resemble closely to those shown in Figure [3J 
please refer to the supplementary material specific 
plots. These learned optimal penalty functions are 
clearly non-linear which explains why predefined linear 
penalty functions produce inferior matching results. 

6.3. Caltech Aeroplane and Motorbike 

Finally, we are showing some more qualitative re- 
sults on Caltech objects, such as airplanes and motor- 
bikes, in Figure |5j The experimental setup is exactly 
same as that described in the last section. Notice that, 
in this experiment, we are establishing correspondences 
between two different instances of same object cate- 
gory, unlike the experiments described before. 

7. Discussion 

In this paper, we propose a novel feature matching 
algorithm based on higher order information among 
them. The feature correspondence problem is formu- 
lated as a hypergraph node labeling problem. A re- 
cent algorithm that models the higher order interac- 
tion among the datapoints using a Markov network 
is applied to address the labeling problem. We de- 
scribe how the associated cost function can be learned 
from labeled data using existing graphical model train- 
ing algorithm. The results show that learning the cost 
function makes the proposed matching algorithm more 
robust than other pairwise and higher order methods. 

This paper presents methods to learn the appropri- 
ate cost functions (in terms of the penalty functions) 
of a hypergraph node labeling algorithm [17 . Feature 
correspondence is one significant application of the su- 
pervised hypergraph labeling algorithm, but the learn- 
ing procedure can benefit any applications of it. We 
strongly believe learning penalty functions will improve 
the performances of model estimation and object local- 
ization demonstrated in [17]. 

Hypergraph labeling method could potentially be 
applied to other problems where learning cost functions 
could be advantageous. One such problem is object 
boundary detection or image segmentation. We per- 
formed a small experiment on natural images of Berke- 
ley dataset. The description of the procedure and sam- 
ple results are shown in the supplementary material to 
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Figure 4. Improvement achieved by learning. Top: results of TT using a predefined linear penalty, bottom: matches after 
learning. More correct correspondences are recovered by learned penalty function. 
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Table 1. Average number of correct and incorrect matches {NOT percentages) found on the image pairs. The proposed 
algorithm with learned penalty functions consistently produces more true positives with less false positives on all the 
sequences. 



avoid confusion. These results suggest the method can 
be used for segmentation problems, at least for specific 
domain if not for natural images, with appropriately 
chosen image features and model. 
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Figure 5. Qualitative results on Caltech aeroplanes and motorbikes. 



